This directory contain all code that was used for the Udacity Data Scientist Nanodegree Program
For this project, the problem statement is given to us , develop an algorithm to predict the default of home credit .
Project Summary: Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
In this project, we ask you to complete the analysis of which customers of home credit were likely default. In particular, we ask you to apply the tools of machine learning to predict which customers defaulted.
Project Metrics: Default customer can be predicted using less variable at credit risk perspective. So selected model specification must be explainable and applicable.
Practice Skills
The dataset is given to us as test and train data at Kaggle's Home Credit Default Risk
The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks.
We will use the popular scikit-learn library to develop our machine learning algorithms and for data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.
To begin this step, The data is imported firstly . Next we use the info() and head() function, to get a quick and dirty overview of variable datatypes (i.e. qualitative vs quantitative). Click here for the Source Data Dictionary.
# train_df
# preview the data
train_df.head(10)
# train_df
#data info
train_df.info(max_cols=1000)
# train_df
# data describe
train_df.describe()
# train_df
# data describe for object
categorical_varaible=train_df.describe(include=['O'])
categorical_varaible
What is the distribution of categorical features?
# train_df
# preview the data
train_df.head(10)
# train_df
# data describe
train_df.describe()
barchart(train_df,'CODE_GENDER')
barchart(train_df,'NAME_CONTRACT_TYPE')
catplot_WTARGET(train_df,'CODE_GENDER','TARGET')
catplot_WTARGET(train_df,'NAME_CONTRACT_TYPE','TARGET')
barchart(train_df,'TARGET')
barchart(train_df,'FLAG_OWN_CAR')
catplot_WTARGET(train_df,'FLAG_OWN_CAR','TARGET')
barchart(train_df,'FLAG_OWN_REALTY')
sns.scatterplot(x="AMT_CREDIT", y="AMT_INCOME_TOTAL" , data=train_df)
catplot_WTARGET(train_df,'FLAG_OWN_REALTY','TARGET')
catplot_WTARGET(train_df,'NAME_TYPE_SUITE','TARGET')
piechart(train_df,'NAME_TYPE_SUITE')
piechart(train_df,'NAME_INCOME_TYPE')
piechart(train_df,'NAME_EDUCATION_TYPE')
piechart(train_df,'NAME_HOUSING_TYPE')
In this stage, data should have been cleaned
Correcting: Reviewing the data, there should have been analyzed to be any abnormal or non-acceptable data inputs. In addition, age and income may have outlier values.Exploratory analysis will done to find reasonable values. Outliers should been elimated in dataset.It should be noted, that if unreasonable values were , for example age is 1000 then it also should be elimaneted.
Completing: There are null values or missing data in dataset. Missing values can be bad, because some algorithms don't know how-to handle null values and will fail. While others, like decision trees, can handle null values. Thus, it's important to fix before modeling will started because several models will have compared. There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it's best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation.
Creating: Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome.
Converting: Last, but certainly not least, we'll deal with formatting. There are no date or currency formats, but datatype formats. Our categorical data imported as objects, which makes it difficult for mathematical calculations. For this dataset, we will convert object datatypes to categorical dummy variables
We have been analyzed for dataset. We have seen the maximumum count of childred variable. So maximum age is 19. Outlier have elimated for count of children=19.
We have not seen any anormaly dataset. We check this step for dataset
We can analyze the missing value. But some data can not completed for missing value. Because some values normally is missingç For example customers who have no credit bureau information and related coloumns have no information about customer. We can filter variable of occupation_type because of 96391 missing value.
Days_employed variable divided by Days_birts variable is calculated days_employed_perc in train and test dataset
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("CODE_GENDER", "AGE_CAL", hue="TARGET", data=train_df,split=True,ax=ax[0])
ax[0].set_title('CODE_GENDER and AGE_CAL vs TARGET')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("NAME_CONTRACT_TYPE","AGE_CAL", hue="TARGET", data=train_df,split=True,ax=ax[1])
ax[1].set_title('NAME_CONTRACT_TYPE and AGE_CAL vs TARGET')
ax[1].set_yticks(range(0,110,10))
plt.show()
Age grouping have been appeared need in this graphs. We think age group have been in below side
#https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition
def f(row):
if row['AGE_CAL'] < 30:
AGE_BIN = 1
elif row['AGE_CAL'] < 45:
AGE_BIN = 2
else:
AGE_BIN = 3
return AGE_BIN
train_df['AGE_BIN'] = train_df.apply(f, axis=1)
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("AGE_BIN", "DAYS_EMPLOYED", hue="TARGET", data=train_df,split=True,ax=ax[0])
ax[0].set_title('DAYS_EMPLOYED and AGE_BIN vs TARGET')
ax[0].set_yticks(range(0,110,10))
def density_plot (df,varaible):
plt.figure(figsize = (10, 8))
# KDE plot of loans that were repaid on time
sns.kdeplot(df.loc[df['TARGET'] == 0, varaible], label = 'target == 0')
# KDE plot of loans which were not repaid on time
sns.kdeplot(df.loc[df['TARGET'] == 1, varaible], label = 'target == 1')
# Labeling of plot
plt.xlabel(varaible); plt.ylabel('Density'); plt.title(varaible);
density_plot(train_df,'AMT_CREDIT')
train_df['TARGET'].value_counts()
All variable analyze the correlation of target. We will choose higher than 0.05 or lower than -0.005. Correlations are very useful in many applications, especially when conducting regression analysis. However, it should not be mixed with causality and misinterpreted in any way. I should also always check the correlation between different variables in our dataset and gather some insights as part of my exploration and analysis.
correlation_heatmap(train_df_v13)
correlation_heatmap(train_df_v1)
REGION_RATING_CLIENT AND DAYS_ID_PUBLISH is higher than 0.05 correlation. So 2 vairables are selected as final vairables
correlation_heatmap(train_df_v2)
REGION_RATING_CLIENT_W_CITY, EXT_SOURCE_1 EXT_SOURCE_2,EXT_SOURCE_3 is higher than 0.05 correlation. So 4 vairables are selected as final vairables
correlation_heatmap(train_df_v3)
correlation_heatmap(train_df_v4)
DAYS_LAST_PHONE_CHANGE is higher than 0.05 correlation. So 1 vairable İS selected as final vairables
correlation_heatmap(train_df_v5)
correlation_heatmap(train_df_v6)
AGE_CALC, CODE_GENDER_F,CODE_GENDER_M are higher than 0.05 correlation. So 3 vairables are selected as final vairables
correlation_heatmap(train_df_v7)
correlation_heatmap(train_df_v8)
correlation_heatmap(train_df_v9)
correlation_heatmap(train_df_v10)
NAME_EDUCATION_TYPE_Secondary / secondary special are higher than 0.05 correlation. So 1 vairable are selected as final vairable
correlation_heatmap(train_df_v11)
correlation_heatmap(train_df_v12)
NAME_INCOME_TYPE_Working is higher than 0.05 correlation. So 1 vairable are selected as final vairable
Finally NAME_INCOME_TYPE_Working, NAME_EDUCATION_TYPE_Secondary / secondary special, AGE_CAL, CODE_GENDER_F,CODE_GENDER_M ,DAYS_LAST_PHONE_CHANGE, REGION_RATING_CLIENT_W_CITY, EXT_SOURCE_1 EXT_SOURCE_2,EXT_SOURCE_3 are selected final variables
final_list={'NAME_INCOME_TYPE_Working', 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'AGE_BIN', 'CODE_GENDER_F','CODE_GENDER_M' ,'DAYS_LAST_PHONE_CHANGE', 'REGION_RATING_CLIENT_W_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2','EXT_SOURCE_3','TARGET'}
train_df_final_list=train_df[final_list]
correlation_heatmap(train_df_final_list)
final_list_V1={'NAME_INCOME_TYPE_Working', 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'AGE_BIN', 'CODE_GENDER_F','CODE_GENDER_M' ,'DAYS_LAST_PHONE_CHANGE', 'REGION_RATING_CLIENT_W_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2','EXT_SOURCE_3'}
train_df_final_list_V1=train_df[final_list_V1]
correlation_heatmap(train_df_final_list_V1)
final_list_V2_with_target={ 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'CODE_GENDER_F' ,'DAYS_LAST_PHONE_CHANGE','TARGET'}
final_list_V2={ 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'CODE_GENDER_F' ,'DAYS_LAST_PHONE_CHANGE'}
test_final_list_V2={ 'NAME_EDUCATION_TYPE_Secondary / secondary special', 'AGE_CAL', 'CODE_GENDER_F' ,'DAYS_LAST_PHONE_CHANGE'}
train_df_final_list_V2=train_df[final_list_V2]
test_df_final_list_V2=test_df[test_final_list_V2]
train_df_final_list_V2_with_target=train_df[final_list_V2_with_target]
correlation_heatmap(train_df_final_list_V2_with_target)
the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset. In order to avoid this, we can perform something called cross validation. It’s very similar to train/test split, but it’s applied to more subsets. I decided to split size in below side
References: https://tarangshah.com/blog/2017-12-03/train-validation-and-test-sets/
In literature logistic regression have been used usually for credit risk modeling. So I selected logistic regression modelling approach. If accuracy and precision is lower than expectation, we will try more machine learning methodolgy.
References: https://smartdrill.com/pdf/Credit%20Risk%20Analysis.pdf
In literature modelling result is done comparing auc roc score and precision. We have focused auc roc score and precision. It is one of the most important evaluation metrics for checking any classification model’s performance. AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).
In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved).
It is also important for us that what percentage of estimated defaults are true defaults. References:https://en.wikipedia.org/wiki/Precision_and_recall#F-measure References:https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
#Model Alternative 1
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=20)
#old
# check classification scores of logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)
print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_pred_proba))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))
idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95
plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_pred))
print("Using a threshold of %.3f " % thr[idx] + "guarantees a sensitivity of %.3f " % tpr[idx] +
"and a specificity of %.3f" % (1-fpr[idx]) +
", i.e. a false positive rate of %.2f%%." % (np.array(fpr[idx])*100))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
#Model Alternative 1
#Cross Validation
# check classification scores of logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_val)
y_pred_proba = logreg.predict_proba(X_val)[:, 1]
[fpr, tpr, thr] = roc_curve(y_val, y_pred_proba)
print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_val, y_pred))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_val, y_pred_proba))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))
idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95
plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_pred))
print("Using a threshold of %.3f " % thr[idx] + "guarantees a sensitivity of %.3f " % tpr[idx] +
"and a specificity of %.3f" % (1-fpr[idx]) +
", i.e. a false positive rate of %.2f%%." % (np.array(fpr[idx])*100))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Our model have not been predicted default customers. So we will need to apply new machine learning aproach and data set aproach
train_df['TARGET'].value_counts()
We are doing sampling 50:50. Our dataset 24825 default customer and 24825 non-default customer. So we will train the alternative model for this dataset References:https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
default = train_df_final_list_V2_with_target[train_df_final_list_V2_with_target['TARGET']==1]
#nondefault = train_df_final_list_V2[train_df_final_list_V2.TARGET=="0"]
nondefault = train_df_final_list_V2_with_target[train_df_final_list_V2_with_target['TARGET']==0]
train_df['TARGET'].value_counts()
# We randomly select 492 nondefault
nondefault_sub = nondefault.sample(24825, random_state=25)
# dataset_sub is the dataset composed of 24825 nondefault and of 492 default
dataset_sub = default.append(nondefault_sub, ignore_index=True)
print('This sub dataset contains ',dataset_sub.shape[0],'rows')
print('This sub dataset contains ',dataset_sub.shape[1],'columns')
dataset_sub_wio_Target=dataset_sub.drop(['TARGET'], axis=1)
X=dataset_sub_wio_Target
y=dataset_sub['TARGET']
seed = 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=seed)
# Proportion of default in train set and test set
print('Proportion of default in train:',y_train[y_train == True].shape[0]/X_train.shape[0])
print('Proportion of default in test:',y_test[y_test == True].shape[0]/X_test.shape[0])
print('Proportion of default in valditaion:',y_val[y_val == True].shape[0]/X_val.shape[0])
# Evaluation of each model
for name,model in models:
print('----------',name,'----------')
get_score_models(model,X_train,X_test,y_train,y_test)
# Evaluation of each ensemble method
for name,ensemble in ensembles:
print('----------',name,'----------')
get_score_ensembles(ensemble,X_train,X_test,y_train,y_test)
QuadraticDiscriminantAnalysis is best score Precision and Area under the curve
GradientBoostingClassifier is best score Precision and Area under the curve
QuadraticDiscriminantAnalysis is best model for results_precision and results_auc in kfold validation data. So the best model QuadraticDiscriminantAnalysis for this sampling dataset
Now, we evaluate the performance of our classifiers with a 10-Fold cross validation.
# 10-Fold cross validation on our models
for name,model in models:
cross_validation(name,model,models_score,results_precision,results_aupcr)
# 10-Fold cross validation on ensembles
for name,ensemble in ensembles:
cross_validation(name,ensemble,ensembles_score,results_precision,results_aupcr)
# Compare Classifiers regarding Precision
fig = plt.figure()
fig.suptitle('Classifiers Precision Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_precision)
ax.set_xticklabels(names)
plt.show()
# Compare Classifiers regarding the Precision
fig = plt.figure()
fig.suptitle('Classifiers AUPRC Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_aupcr)
ax.set_xticklabels(names)
plt.show()
QuadraticDiscriminantAnalysis is best model for results_precision and results_auc
We selected QuadraticDiscriminantAnalysis is best score Precision and Area under the curve
Precision: 0.6044389130885708 Area under the curve: 0.6088810563323734
Model varaibles are
My expectation would be credit type,credit amount or income type for final variable modelling. But these variable were eliminated correlation step. I am surprised for this happen. This proeject aimed to end to end data processing and data modelling in credit risk data. I enjoyed to analyze and create this project